ReduSpatial-master folder that you downloaded).File >> New File >> R Script (or Ctrl + Shift + N keyboard shortcut).Materials in the course will mix R code, R output maps and graphs with some commentary in between.
Most importantly R code will appear in “code blocks” - grey boxes with colour coded syntax (highlighting elements of code, depending on their function) and R output will appear below them, also surrounded by boxes but without background:
Use the code from materials and copy paste it to your scripts. Use the output in materials to compare results inside your RStudio console or “Plots” or “Viewer” windows.
We will start our script with loading set of libraries that will allow us to read, manipulate and visualize spatial data. For the moment you can ignore the output.
if (!require("tidyverse")) install.packages("tidyverse"); library("tidyverse")## Loading required package: tidyverse
## -- Attaching packages ----------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.0
## v tibble 2.0.1 v dplyr 0.8.0.1
## v tidyr 0.8.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts -------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
if (!require("sf")) install.packages("sf"); library("sf")## Loading required package: sf
## Linking to GEOS 3.6.1, GDAL 2.2.3, PROJ 4.9.3
if (!require("tmap")) install.packages("tmap"); library("tmap")## Loading required package: tmap
if (!require("tmaptools")) install.packages("tmaptools"); library("tmaptools")## Loading required package: tmaptools
if (!require("sjmisc")) install.packages("sjmisc"); library("sjmisc")## Loading required package: sjmisc
##
## Attaching package: 'sjmisc'
## The following object is masked from 'package:purrr':
##
## is_empty
## The following object is masked from 'package:tidyr':
##
## replace_na
## The following object is masked from 'package:tibble':
##
## add_case
We will work with four packages
tidyverse - set of packages for data manipulation and graphs (contains ggplot and dplyr)sf simple features package to handle geographical datatmap mapping librarysjmisc package for tabulating & summarizing dataAn important piece of information when using spatial data is the geography to which it refers. ABS data are disseminated based on a number of geographies classified according to the Australian Statistical Geography Standard (ASGS).
ASGS ABS Structures. Source: Australian Bureau of Statistics.
Mesh Blocks (MBs) are the smallest geographical areas defined. Due to confidentiality concerns however, data are more commonly analysed using units that are higher up the hierarchy - the Statistical Areas. Statistical Areas Level 2 (SA2s) are a common choice for small area analyses and ABS describes them as:
designed to reflect functional areas that represent a community that interacts together socially and economically. They consider Suburb and Locality boundaries to improve the geographic coding of data to these areas and in major urban areas SA2s often reflect one or more related suburbs. The SA2 is the smallest area for the release of many ABS statistics, including the Estimated Resident Population (ERP), Health & Vitals and Building Approvals data. SA2s generally have a population range of 3,000 to 25,000 persons, and have an average population of about 10,000 persons.
You can find more information about ASGS here. You can see how different geographies used by ABS fit (or not!) into each other on this interactive map.
We will work with a subset of 2016 SA2 areas created by ABS. sf package function st_read will read data from a SA2_2016_MELB.shp file. After reading we will set up its coordinate system correctly with st_set_crs.
SA2_2016_MELB <- st_read("./data/SA2_2016_MELB.shp") %>%
st_set_crs(4283)## Reading layer `SA2_2016_MELB' from data source `C:\Users\uqrpancz\Google Drive\edu\ReduSpatial\data\SA2_2016_MELB.shp' using driver `ESRI Shapefile'
## Simple feature collection with 309 features and 4 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 144.3336 ymin: -38.50299 xmax: 145.8784 ymax: -37.1751
## epsg (SRID): NA
## proj4string: +proj=longlat +ellps=GRS80 +no_defs
Pay attention to special characters used in the code block:
<- as “assignment”" to create SA2_2016_MELB object%>% to pass the object for further opearations, in our case - defining a projection; easy way to think about pipes is to read them to yourself as “then”You can easily recognize functions coming from sf package with a st_ prefix. Remaining part tries to describe what a function does, most likely using a verb.
Our input data comes from SA2_2016_MELB.shp file. This is a shapefile which is a common format for storing and sharing geospatial data. Shapefile is actually a collection of minimum three files. Take a look into .\ReduSpatial\data directory and search for it. Read more about shapefile on the Wikipedia page.
Lets create first map. We will use tmap library to help us out. We first set up “mode” to “view (which means interactive map”):
tmap_mode("view")## tmap mode set to interactive viewing
Then, use qtm function to show boundaries of SA2 areas:
qtm(SA2_2016_MELB)This is a simple map representing Statistical Area Level 2 (SA2) polygons overlaid on some background maps. Click around your “Viewer” panel - this is an interactive map that allows you zooming in and out, identify region and change background maps. See if you can find buttons that let you export the map, move it to spearate window or change background maps.
Let’s check what kind of geometry we have in our data:
table(droplevels(st_geometry_type(SA2_2016_MELB)))##
## MULTIPOLYGON
## 309
And what type of projection system it is in:
st_crs(SA2_2016_MELB)## Coordinate Reference System:
## EPSG: 4283
## proj4string: "+proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs"
Our is Coordinate Reference System has a code EPSG: 4283. That means our data are using Geocentric_Datum_of_Australia_1994 (more (details)[https://epsg.io/4283]).
Let’s use few R functions to check how the dataset is structured. We can use functions of R to dexcribe how many rows dataset has:
nrow(SA2_2016_MELB)## [1] 309
What are the names of the data:
names(SA2_2016_MELB)## [1] "SA2_MAIN16" "SA2_NAME16" "SA4_CODE16" "SA4_NAME16" "geometry"
Show first five rows:
slice(SA2_2016_MELB, 1:5)## Simple feature collection with 5 features and 4 fields
## geometry type: MULTIPOLYGON
## dimension: XY
## bbox: xmin: 144.9267 ymin: -37.78021 xmax: 144.9869 ymax: -37.73251
## epsg (SRID): 4283
## proj4string: +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
## SA2_MAIN16 SA2_NAME16 SA4_CODE16 SA4_NAME16
## 1 206011105 Brunswick 206 Melbourne - Inner
## 2 206011106 Brunswick East 206 Melbourne - Inner
## 3 206011107 Brunswick West 206 Melbourne - Inner
## 4 206011108 Coburg 206 Melbourne - Inner
## 5 206011109 Pascoe Vale South 206 Melbourne - Inner
## geometry
## 1 MULTIPOLYGON (((144.9497 -3...
## 2 MULTIPOLYGON (((144.9734 -3...
## 3 MULTIPOLYGON (((144.9341 -3...
## 4 MULTIPOLYGON (((144.9485 -3...
## 5 MULTIPOLYGON (((144.9326 -3...
And examine structure of data in more detail:
str(SA2_2016_MELB)## Classes 'sf' and 'data.frame': 309 obs. of 5 variables:
## $ SA2_MAIN16: num 2.06e+08 2.06e+08 2.06e+08 2.06e+08 2.06e+08 ...
## $ SA2_NAME16: Factor w/ 309 levels "Abbotsford","Airport West",..: 37 38 39 68 221 4 212 282 10 108 ...
## $ SA4_CODE16: Factor w/ 9 levels "206","207","208",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ SA4_NAME16: Factor w/ 9 levels "Melbourne - Inner",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ geometry :sfc_MULTIPOLYGON of length 309; first list element: List of 1
## ..$ :List of 1
## .. ..$ : num [1:168, 1:2] 145 145 145 145 145 ...
## ..- attr(*, "class")= chr "XY" "MULTIPOLYGON" "sfg"
## - attr(*, "sf_column")= chr "geometry"
## - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA
## ..- attr(*, "names")= chr "SA2_MAIN16" "SA2_NAME16" "SA4_CODE16" "SA4_NAME16"
We can plot our date and overlay them with other maps but there is very little information contained there apart from the boundaries and their names. We will link other data to our polygons that will let us describe places.
We will work with data from Austrlian Bureau of Statistics (ABS) that describe geographical areas using set of Census 2016 indicators. We will use Socio-Economic Indexes for Areas (SEIFA):
an ABS product that ranks areas in Australia according to relative socio-economic advantage and disadvantage. SEIFA 2016 has been created from Census 2016 data and consists of four indexes: The Index of Relative Socio-economic Disadvantage (IRSD); The Index of Relative Socio-economic Advantage and Disadvantage (IRSAD); The Index of Education and Occupation (IEO); The Index of Economic Resources (IER).
More information and the data can be found here.
We will load one more dataset. This dataset is not a spatial dataset and is saved in R’s Rds format so we can read it using readRDS function.
SEIFA <- readRDS("data/SEIFA.Rds")Lets again examine the data using thesame functions we used to learn more about SA2 areas (output omitted):
nrow(SEIFA)
names(SEIFA)
slice(SEIFA, 1:5)
str(SEIFA)Instead of looking at the whole data frame, we can select individual variables. We use descr function from sjmisc package to return few descriptive statistics on IRSAD score:
descr(SEIFA, IRSAD_s)##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd se md trimmed
## IRSAD_s numeric IRSAD_s 2184 0.32 997.49 84.42 1.81 997 999.38
## range skew
## 580 (604-1184) -0.43
and deciles:
descr(SEIFA, IRSAD_d)##
## ## Basic descriptive statistics
##
## var type label n NA.prc mean sd se md trimmed range
## IRSAD_d integer IRSAD_d 2184 0.32 5.5 2.87 0.06 5.5 5.5 9 (1-10)
## skew
## 0
It might be more informative to look at the frequencies of deciles instead of summary statistics:
frq(SEIFA, IRSAD_d)##
## # IRSAD_d <integer>
## # total N=2191 valid N=2184 mean=5.50 sd=2.87
##
## val frq raw.prc valid.prc cum.prc
## 1 218 9.95 9.98 9.98
## 2 218 9.95 9.98 19.96
## 3 219 10.00 10.03 29.99
## 4 218 9.95 9.98 39.97
## 5 219 10.00 10.03 50.00
## 6 218 9.95 9.98 59.98
## 7 219 10.00 10.03 70.01
## 8 218 9.95 9.98 79.99
## 9 219 10.00 10.03 90.02
## 10 218 9.95 9.98 100.00
## <NA> 7 0.32 NA NA
Finally, using pipes and dplyr functions to sort and group data we can combine two variables together and look at descriptive statistics of score for each decile if IRSAD:
SEIFA %>%
arrange(IRSAD_d) %>%
group_by(IRSAD_d) %>%
descr(IRSAD_s)##
## ## Basic descriptive statistics
##
## Grouped by:
## IRSAD_d: 1
## var type label n NA.prc mean sd se md trimmed
## IRSAD_s numeric IRSAD_s 218 0 845.33 59.66 4.04 867.5 857.74
## range skew
## 294 (604-898) -1.97
##
## Grouped by:
## IRSAD_d: 2
## var type label n NA.prc mean sd se md trimmed
## IRSAD_s numeric IRSAD_s 218 0 914.36 8.71 0.59 915 914.48
## range skew
## 31 (898-929) -0.11
##
## Grouped by:
## IRSAD_d: 3
## var type label n NA.prc mean sd se md trimmed
## IRSAD_s numeric IRSAD_s 219 0 940.38 6.93 0.47 939 940.28
## range skew
## 24 (929-953) 0.13
##
## Grouped by:
## IRSAD_d: 4
## var type label n NA.prc mean sd se md trimmed
## IRSAD_s numeric IRSAD_s 218 0 963.53 6 0.41 963 963.45
## range skew
## 22 (953-975) 0.12
##
## Grouped by:
## IRSAD_d: 5
## var type label n NA.prc mean sd se md trimmed
## IRSAD_s numeric IRSAD_s 219 0 986.04 6.61 0.45 985 986.07
## range skew
## 22 (975-997) -0.01
##
## Grouped by:
## IRSAD_d: 6
## var type label n NA.prc mean sd se md trimmed
## IRSAD_s numeric IRSAD_s 218 0 1008.85 7.18 0.49 1009 1008.73
## range skew
## 25 (997-1022) 0.1
##
## Grouped by:
## IRSAD_d: 7
## var type label n NA.prc mean sd se md trimmed
## IRSAD_s numeric IRSAD_s 219 0 1034.17 7.03 0.48 1035 1034.29
## range skew
## 24 (1022-1046) -0.2
##
## Grouped by:
## IRSAD_d: 8
## var type label n NA.prc mean sd se md trimmed
## IRSAD_s numeric IRSAD_s 218 0 1058.89 7.74 0.52 1058 1058.87
## range skew
## 27 (1046-1073) 0.04
##
## Grouped by:
## IRSAD_d: 9
## var type label n NA.prc mean sd se md trimmed
## IRSAD_s numeric IRSAD_s 219 0 1088.58 10.01 0.68 1088 1088.31
## range skew
## 33 (1073-1106) 0.24
##
## Grouped by:
## IRSAD_d: 10
## var type label n NA.prc mean sd se md trimmed
## IRSAD_s numeric IRSAD_s 218 0 1134.52 19.35 1.31 1130 1133.14
## range skew
## 77 (1107-1184) 0.59
A picture is (sometimes) worth a thousand words so it might be easier to look at your variables and relation between them using graphs. Lets look at the histogram of score:
ggplot(SEIFA, aes(IRSAD_s)) + geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 7 rows containing non-finite values (stat_bin).
Here is the bar chart showin frequency of data across deciles:
ggplot(SEIFA, aes(IRSAD_d)) + geom_bar()## Warning: Removed 7 rows containing non-finite values (stat_count).
And lastly, box plot describing score across deciles:
ggplot(SEIFA, aes(as.factor(IRSAD_d), IRSAD_s)) + geom_boxplot()## Warning: Removed 7 rows containing non-finite values (stat_boxplot).
Pay attention to warning! They do not stop the execution of your code but provide you with some hints when uexpected behaviour. Use View(SEIFA) to open data frame and browse through it. Can you spot the problem that caused the warning?
We have now two separate objects in our R session. You can see them in your “Environment” tab. We will now join them together using left_join function from dplyr package:
SA2_SEIFA <- left_join(SA2_2016_MELB, SEIFA, by = c("SA2_MAIN16", "SA2_MAIN16"))SA2_2016_MELB simple features object can be manipulated in the same way as every data frame. We merged data from SEIFA data frame using common variable SA2_MAIN16. Technically we wouldn’t have to specify the “key” used to merge data because it has the same name in both sources but you might need that functionalty if names are different.
Use nrow function (or peek into your “Environment” tab) to check how many rows each of our datasets has now. Use help(left_join) command (or manual search) in your “Help” panel to find help about the left_join function. Can you realte it to the output of nrow?
One important thing to check before any further work with merged data is to check if your data has missing information. We can do it by tabulating output of is.na function (“NA” stands for missing data):
table(is.na(SA2_SEIFA$IRSAD_s))##
## FALSE TRUE
## 305 4
We have some areas with missing information. filtering them to a qtm map can provide some insights about why there are missing:
SA2_SEIFA %>%
filter(is.na(IRSAD_s)) %>%
qtm(fill = "red",
fill.alpha = 0.5, border = NULL)Can you guess why theseareas do not have SEIFA scores?
We put a lot of work to get data in good shape. They will be used in second part of tutorial so we save them using saveRDS function:
saveRDS(SA2_SEIFA, "data/SA2_SEIFA.Rds")Examine SEIFA indices for Melbourne in the same manner as you did for the whole Austrlia (section “Examine data”). What differences do you see? How could you explain them?
Examine relationships between two different indices for Autralia or Melbourne. What kind of graphs and statistics could be useful here?